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(57) A well-known problem in state-of-the-art 
speech recognition system is that often pairs of words 
occur that are very similar to each other and thus are 
confusable. This may cause errors in the recognition 
phase and thus decrease recognition rates. If unsuper- 
vised speaker adaptation is used in such a system, 
these misrecognitions may cause adaptation of the 
wrong models and thus cause a further decrease in per- 
formance. Therefore, according to the present inven- 
tion, confusable words within the vocabulary are marked 
and an adaptation of the system to a certain user with 
such marked words is only performed in case of a pos- 
itive confirmation of the recognition result by the user. 
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Description 

[0001] This invention is related to a method to perform 
an adaptation of an automatic speech recognition sys- 
tem, in particular to a method to prevent an adaptation 
of the wrong models in a speech recognition system. 
[0002] State of the art speech recognizers consist of 
a set of statistical distribustions modeling the acoustic 
properties of certain speech segments. These acoustic 
properties are encoded in feature vectors. As an exam- 
ple, one Gaussian distribution can be taken for each 
phoneme. These distributions are attached to states. A 
(stochastic) state transition network (usually hidden 
Markov models) defines the probabilities for sequences 
of states and sequences of feature vectors. Passing a 
state consumes one feature vector covering a frame of 
e.g. 1 0 ms of the speech signal. 
[0003] The stochastic parameters of such a recogniz- 
er are trained using a large amount of speech data either 
from a single speaker yielding a speaker dependent 
(SD) system or from many speakers yielding a speaker 
independent (SI) system. 

[0004] Speaker adaptation (SA) is a widely used 
method to increase recognition rates of SI systems. 
State of the art speaker dependent systems yield much 
higher recognition rates than speaker independent sys- 
tems. However, for many applications, it is not feasible 
to gather enough data from a single speaker to train the 
system. In case of a consumer device this might even 
not be wanted. To overcome this mismatch in recogni- 
tion rates, speaker adaptation algorithms are widely 
used in order to achieve recognition rates that come 
close to speaker dependent systems, but only use a 
fraction of speaker dependent data compared to speak- 
er dependent ones. These systems initially take speaker 
independent models that are then adapted so as to bet- 
ter match the speaker's acoustics. 
[0005] Usually, the speaker adaptation is performed 
in supervised mode! That is the spoken words are 
known and the recognizer is forced to recognize them. 
Herewith a time alignment of the segment-specific dis- 
tributions is achieved. The mismatch between the actual 
feature vectors and the parameters of the correspond- 
ing distribution builds the basis for the adaptation. The 
supervised adaptation requires an adaptation session 
to be done with every new speaker before he/she can 
actually use the recognizer. 

[0006] Usually, the speaker adaptation techniques 
modify the parameters of the hidden Markov models so 
that they better match the new speakers acoustics. Nor- 
mally, in batch or off-line adaptation a speaker has to 
read a pre-defined text before he/she can use the sys- 
tem for recognition, which is then processed to do the 
adaptation. Once this is finished the system can be used 
for recognition. This mode is also called supervised ad- 
aptation, since the text was known to the system and a 
forced alignment of the corresponding speech signal to 
the models corresponding to the text is performed and 



used for adaptation. 

[0007] However, an unsupervised or on-line method 
is better suited for most kinds of consumer devices. In 
this case, adaptation takes place while the system is in 
5 use. The recognized utterance is used for adaptation 
and the modified models are used for recognizing the 
next utterance and so on. In this case the spoken text 
is not known to the system, but the word(s) that were 
recognized are taken instead. 
io [0008] An adapatation of the speaker adapted model 
set can be repreatedly performed to further improve the 
performance of specific speakers. There are several ex- 
isting methods for speaker adaptation, e.g. maximum a 
posteriori adaptation (MAP) or maximum likelihood !in- 
15 ear regression (MLLR) adaptation. 

[0009] For speech recognition systems often the 
problem arises that the vocabulary comprises many 
words that sound similar. As a consequence it is often 
difficult to distinguish between these words and this of- 
20 ien causes misrecognitions. If a system uses unsuper- 
vised speaker adaptation to improve its models for par- 
ticular speakers, these misrecognitions may lead adap- 
tation to the wrong direction and this may have an ad- 
verse effect on the recognition rates, since then the 
25 wrong models are modified. 

[0010] State-of-the-art speech recognition systems 
try to resolve ambiguities using grammars and language 
models that define a structure of valid sentences, so that 
in some cases ambiguities can be resolved by this. 
30 [0011] Another method disclosed in EP 0 763 812 A1 
is the use of verification methods to reduce the conf us- 
ability of certain words. It is a mathematical approach in 
which confidence measures are used for verification of 
n-best recognized words strings. The result of this ver- 
35 ification procedure (the derivative of the loss function) 
is used as an optimization criterion for HMM training pri- 
or to the use of the system. In this case, all utterances 
are used for training and the method is used to maximize 
the difference in the likelihood of confusable words. 
40 [0012] However, in supervised or especially in unsu- 
pervised speech recognition systems misrecognitions 
can occur so that then the wrong HMMs will be adapted. 
If this happens repeatedly, recognition performance 
may decrease drastically. 
45 [0013] Therefore, it is the object underlying the 
present invention to propose a method for adaptation 
that overcomes the problems described above. 
[0014] The invenlive method is defined in independ- 
ent claim 1 , Preferred embodiments thereof are defined 
so jn the respective following dependent claims. 

[0015] This problem is solved by avoiding adaptation 
based on a misrecognized word if this is confusable, e. 
g. highly confusable with other words. 
[0016] According to the inventive method the speech 
55 recognition system is made aware of such highly con- 
fusable words and if such a word is recognized, double 
checks the recognition result by asking for confirmation 
from the user. Only when the system can be sure that 
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such a word was recognized correctly, it will be used for 
adaptation. 

[0017] Therefore, prior to the recognition phase, it is 
determined which words in the vocabulary are highly 
confusable wilh other words. This is e.g. done by com- 
paring and computing the number of differing phonemes 
in relation to the total number of the words. Another pos- 
sibility might be to use sets of template speech signals 
representing all words in the vocabulary and then com- 
puting a distance between these words. Such templates 
can preferably be Hidden Markov Models. Of course the 
determination of the confusability is not limited to this. 
[0018] Also, a grade of confusabiltity of a certain word 
contained in the vocabulary to the other words of the 
vocabulary can be determined. This can be done man- 
ually or automatically using well known similarity meas- 
ures for phoneme strings and/or HMMs. In this case not 
only highly confusable words, but also words that are 
confusable at all have to get a confirmation or may also 
or instead be processed wilh other verification technol- 
ogies. 

[0019] In any case, for each word in the vocabulary it 
is known if and with which word(s) it is confusable and 
the grade of this respective confusability. If during rec- 
ognition one of the words that was previously marked 
as being confusable is recognized, e.g. as highly con- 
fusable, the user is asked to confirm the recognition re- 
sult and in case it was misrecognized, to repeat or spell 
it (if the user interface comprises a keyboard he/she 
could also type it; other input modalities are also suitable 
for correction purposes). After that is done the system 
can use the speech signal of the previously misrecog- 
nized words for which it now knows the correct word for 
adaptation. If the word was not a confusable one, no 
confirmation from the user is needed but other methods 
to verify the reliability of the recognition results may be 
applied. 

[0020] As a result of the inventive method the confus- 
ability of generally confusable words will decline, be- 
cause always the right models are adapted and thus the 
discrimination for highly confusable words should be- 
come easier. 

[0021] The inventive method to perform an adaptation 
of an automatic speech recognition system will be better 
understood from the following detailed description of an 
exemplary embodiment thereof taken in conjunction 
with the appendent drawings wherein: 

Figure 1 shows the process of determination of con- 
fusability between words in the vocabulary 
according to the present invention; and 

Figure 2 shows the procedure to perform an adap- 
tation according to the present invention. 

[0022] Figure 1 shows the determination of confusa- 
bility between words in the vocabulary prior to the rec- 
ognition phase. According to the exemplary embodi- 
ment, it is determined here which words in the vocabu- 
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lary are highly confusable with other words. 
[0023] After the start of the procedure in a step SO, it 
is confirmed in a step S1 whether to process another 
word or not. In case no additional word should be proc- 

s essed the procedure will be set forth with step S6 to be 
ended. In the other case the procedure will be set forth 
with step S2 in which a new word is added to the vocab- 
ulary. Thereafter, the confusability of this new word to 
all other words already contained in the vocabulary is 

10 computed in a step S3. 

[0024] As mentioned above, there are several meth- 
ods to compute the confusability and also several re- 
sults, i.e. several grades of conlusability. 
[0025] In a step S4 it is determined if this new word is 

is highly confusable to other words. However, the deter- 
mination whether this word is highly confusable with oth- 
er words is no limitation of the present invention. It could 
also be checked in this step S4 if the word is confusable 
with other words at all and the grade of confusability will 

20 be passed on to the next step. In this exemplary embod- 
iment, however, a classification of the grade of confus- 
ability is already performed in step S4. 
[0026] If the new word is highly confusable with other 
words, this word is marked as highly confusable in a step 

25 S5. Thereafter, the procedure is set forth again with step 
SI in which it is checked if another word should be proc- 
essed. If the word is not regarded as to be highly con- 
fusable with other words in step S4, the procedure is 
directly set forth with step S1 . 

30 [0027] Figure 2 shows the endless loop in which an 
adaptation according to the present invention is per- 
formed during the recognition process. 
[0028] After an utterance of a user was spoken to the 
system in a step S9, a recognition of this utterance is 

35 performed in a step S10. In a step S11 it is checked 
whether one of the recognized words is highly confusa- 
ble with other words (that are possible in this context) 
or not. 

[0029] If the recognized word is regarded as to be 

40 highly confusable with at least one other word of the vo- 
cabulary in step S11 , the user is asked for confirmation 
of the word in step S1 2. After the user's confirmation an 
adaptation of the models is performed in step S13 and 
the next spoken utterance is received in step S9. 

45 [0030] If the recognized word is not regarded as to be 
highly confusable with any other words in step S11 any 
other verification method can be applied to the word in 
slep S1 4. Thereafter, it is checked whether the word was 
recognized correctly or not in step S15. If the word was 

50 recognized correctly in step S15, an adaptation is per- 
formed in step S13, whereafter the next spoken utter- 
ance is received in step S9. If the word was not recog- 
nized correctly the next spoken utterance is directly re- 
ceived in step S9 without performing the adaptation in 

55 step S13. 
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Claims 

1. Method to perform an adaptation of an automatic 
speech recognition system, charact riz d by the 
following steps: 

marking the words in the vocabulary of the 
speech recognition system that are confusable 
with other words prior to the recognition proc- 
ess; 

requesting a confirmation during the recogni- 
tion process when a word marked as confusa- 
ble is recognized; and 

perform an adaptation of the automatic speech 
recognition system with recognized words 
marked as confusable for which a positive con- 
firmation was given. 

2. Method according to claim 1 , characterized by the 
following step: 

verify recognized words not marked as confus- 
able; and 

perform an adaptation of the automatic speech 
recognition system with verified words. 



9. Method according to claim 7 or 8, charact rized in 
that the previously misrecognized word is used for 
adaptation after its correct recognition on basis of 
the repetition and/or spelling. 

s 

10. Method according to anyone of claims 1 to 9, char- 
acterized in that in case of a negative confirmation 
the user is asked to type in the misrecognized word. 

io 11. Method according to anyone of claims 1 to 1 0, char- 
acterized in that the adaptation of the speech rec- 
ognition system is an adaptation of the speaker in- 
dependent Hidden Markov Models to speaker 
adapted Hidden Markov Models. 



15 



12. Method according to claim 11, characterized in 
that the adaptation method is maximum aposteriori 
adaptation or a maximum likelihood linear regres- 
sion adaptation. 
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3. Method according to claim 1 or 2, characterized by 

the following step: 

perform an adaptation of the automatic 
speech recognition system with recognized words 30 
not marked as confusable. 



Method according to anyone of claims 1 to 3, char- 
acterized in that a word in the vocabulary of the 
speech recognition system is marked as confusable 
on the basis of a comparison and computation of 
the number of differing phonemes in relation to the 
total number of phonemes of the word. 

Method according to anyone of claims 1 to 4, char- 
acterized in that a worcKin the vocabulary of the 
speech recognition system is marked as confusable 
on the basis of a computation of a distance measure 
between all words in the vocabulary with the help 
of a set of template speech signals representing all 
words in the vocabulary. 

Method according to claim 5, characterized in that 
said set of template speech signals are Hidden 
Markov Models. 
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Method according to anyone of claims 1 to 6, char- 
acterized in that in case of a negative confirmation 
the user is asked to repeat the misrecognized word. 

Method according to anyone of claims 1 to 7, char- 
act rized in that in case of a negative confirmation 
the user is asked to spell the misrecognized word. 
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